- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
convert : BailingMoE : fix qkv split when head_dim is 0 #12687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks a lot for fixing this so quickly! I tested it and can confirm that this fixes any issues I experianced with Ling-lite-base GGUF convearsion. However unfortinately the resulting GGUF doesn't seam to load in llama.cpp: root@AI:/apool/llama.cpp/build/bin# ./llama-cli -m /mradermacher/tmp/quant/Ling-lite-base.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5016 (fb8c6eb4) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23241 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23260 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 367 tensors from /mradermacher/tmp/quant/Ling-lite-base.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Lite Base
llama_model_loader: - kv   3:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                     bailingmoe.block_count u32              = 28
llama_model_loader: - kv   7:                  bailingmoe.context_length u32              = 16384
llama_model_loader: - kv   8:                bailingmoe.embedding_length u32              = 2048
llama_model_loader: - kv   9:             bailingmoe.feed_forward_length u32              = 5632
llama_model_loader: - kv  10:            bailingmoe.attention.head_count u32              = 16
llama_model_loader: - kv  11:         bailingmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                  bailingmoe.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  13: bailingmoe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               bailingmoe.expert_used_count u32              = 6
llama_model_loader: - kv  15:            bailingmoe.attention.key_length u32              = 0
llama_model_loader: - kv  16:          bailingmoe.attention.value_length u32              = 0
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:            bailingmoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:               bailingmoe.rope.scaling.type str              = none
llama_model_loader: - kv  20:       bailingmoe.leading_dense_block_count u32              = 0
llama_model_loader: - kv  21:                      bailingmoe.vocab_size u32              = 126464
llama_model_loader: - kv  22:      bailingmoe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  23:            bailingmoe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  24:                    bailingmoe.expert_count u32              = 64
llama_model_loader: - kv  25:             bailingmoe.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             bailingmoe.expert_weights_norm bool             = true
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = bailingmoe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,126464]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,126464]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,125824]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 126080
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 126081
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 126081
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   85 tensors
llama_model_loader: - type  f16:  282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 31.30 GiB (16.00 BPW) 
load: special tokens cache size = 262
load: token to piece cache size = 0.8056 MB
print_info: arch             = bailingmoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 0
print_info: n_embd_head_v    = 0
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 0
print_info: n_embd_v_gqa     = 0
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5632
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 16B
print_info: model params     = 16.80 B
print_info: general.name     = Ling Lite Base
print_info: n_layer_dense_lead   = 0
print_info: n_ff_exp             = 1408
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 126464
print_info: n_merges         = 125824
print_info: BOS token        = 126080 '<|startoftext|>'
print_info: EOS token        = 126081 '<|endoftext|>'
print_info: EOT token        = 126081 '<|endoftext|>'
print_info: PAD token        = 126081 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 126081 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 32054.45 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.48 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init: failed to allocate buffer for kv cache
llama_init_from_model: failed to initialize the context: failed to initialize self-attention cache
common_init_from_params: failed to create context with model '/mradermacher/tmp/quant/Ling-lite-base.gguf'
main: error: unable to load model | 
| 
 Sigh, indeed, it's because the base class does this: llama.cpp/convert_hf_to_gguf.py Lines 257 to 259 in a6f32f0 
 I'll see if we can work around that. | 
| 
 Actually, you know what, setting  See PR#2 | 
| @ngxson I still think it's worth it to merge this as it's slightly nicer, even though the model itself is what needs fixing. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time to test this model, but seems good to me
| n_kv_head = self.hparams.get("num_key_value_heads") | ||
| n_embd = self.hparams["hidden_size"] | ||
| head_dim = self.hparams.get("head_dim", n_embd // n_head) | ||
| head_dim = self.hparams.get("head_dim") or n_embd // n_head | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sometimes we have issue where models exported by tranformers has some keys set to None, will discuss with @huggingface team to see if it can be removed in next version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CISC I checked with transformers team, they said that the None value is actually set by a custom code outside of the library.
More importantly, I almost forgot that we actually have Model.find_hparams in convert_hf_to_gguf.py that is perfect for handling such case. If you have time, can you do a pass to change places that currently using .get to find_hparams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method is useful if you have multiple candidates for a value, but I don't see how it applies here?
The issue is not None, but that they set an actual 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I misunderstood the function. But now I think it would be nice if find_hparam can handle this case, maybe add an arg default_value to it?
I'm thinking about this because it was also the case for gemma 3, is was a bit messy because some params are either missing or being null in the config.json, could be possible that many models in the future will have this same behavior.
Missed second head_dim usage in #12678.
Cleaner assignment as well.
Edit: The Ling-lite-base model is still broken though until PR#2 is merged.
@bartowski1182 @nicoboss